{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b4fea928",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Copyright (c) 2024 Microsoft Corporation.\n",
    "# Licensed under the MIT License."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0c4bc9ba",
   "metadata": {},
   "source": [
    "# Neo4j Import of GraphRAG Result Parquet files\n",
    "\n",
    "This notebook imports the results of the GraphRAG indexing process into the Neo4j Graph database for further processing, analysis or visualization. \n",
    "\n",
    "You can also build your own GenAI applications using Neo4j and a number of RAG strategies with LangChain, LlamaIndex, Haystack, and many other frameworks.\n",
    "See: https://neo4j.com/labs/genai-ecosystem\n",
    "\n",
    "Here is what the end result looks like:\n",
    "\n",
    "![](https://dev.assets.neo4j.com/wp-content/uploads/graphrag-neo4j-visualization.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3924e246",
   "metadata": {},
   "source": [
    "## How does it work?\n",
    "\n",
    "The notebook loads the parquet files from the `output` folder of your indexing process and loads them into Pandas dataframes.\n",
    "It then uses a batching approach to send a slice of the data into Neo4j to create nodes and relationships and add relevant properties. The id-arrays on most entities are turned into relationships. \n",
    "\n",
    "All operations use MERGE, so they are idempotent, and you can run the script multiple times.\n",
    "\n",
    "If you need to clean out the database, you can run the following statement\n",
    "\n",
    "```cypher\n",
    "MATCH (n)\n",
    "CALL { WITH n DETACH DELETE n } IN TRANSACTIONS OF 25000 ROWS;\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "id": "adca1803",
   "metadata": {},
   "outputs": [],
   "source": [
    "GRAPHRAG_FOLDER = \"PATH_TO_OUTPUT/artifacts\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7fb27b941602401d91542211134fc71a",
   "metadata": {},
   "source": [
    "### Depedendencies\n",
    "\n",
    "We only need Pandas and the neo4j Python driver with the rust extension for faster network transport."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b57beec0",
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install --quiet pandas neo4j-rust-ext"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "id": "3eeee95f-e4f2-4052-94fb-a5dc8ab542ae",
   "metadata": {},
   "outputs": [],
   "source": [
    "import time\n",
    "\n",
    "import pandas as pd\n",
    "from neo4j import GraphDatabase"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "307dd2f4",
   "metadata": {},
   "source": [
    "## Neo4j Installation\n",
    "\n",
    "You can create a free instance of Neo4j [online](https://console.neo4j.io). You get a credentials file that you can use for the connection credentials. You can also get an instance in any of the cloud marketplaces.\n",
    "\n",
    "If you want to install Neo4j locally either use [Neo4j Desktop](https://neo4j.com/download) or \n",
    "the official Docker image: `docker run -e NEO4J_AUTH=neo4j/password -p 7687:7687 -p 7474:7474 neo4j` "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "id": "b6c15443-4acb-4f91-88ea-4e08abaa4c29",
   "metadata": {},
   "outputs": [],
   "source": [
    "NEO4J_URI = \"neo4j://localhost\"  # or neo4j+s://xxxx.databases.neo4j.io\n",
    "NEO4J_USERNAME = \"neo4j\"\n",
    "NEO4J_PASSWORD = \"\"  # your password\n",
    "NEO4J_DATABASE = \"neo4j\"\n",
    "\n",
    "# Create a Neo4j driver\n",
    "driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USERNAME, NEO4J_PASSWORD))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "70f37ab6",
   "metadata": {},
   "source": [
    "## Batched Import\n",
    "\n",
    "The batched import function takes a Cypher insert statement (needs to use the variable `value` for the row) and a dataframe to import.\n",
    "It will send by default 1k rows at a time as query parameter to the database to be inserted."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "id": "d787bf7b-ac9b-4bfb-b140-a50a3fd205c5",
   "metadata": {},
   "outputs": [],
   "source": [
    "def batched_import(statement, df, batch_size=1000):\n",
    "    \"\"\"\n",
    "    Import a dataframe into Neo4j using a batched approach.\n",
    "\n",
    "    Parameters: statement is the Cypher query to execute, df is the dataframe to import, and batch_size is the number of rows to import in each batch.\n",
    "    \"\"\"\n",
    "    total = len(df)\n",
    "    start_s = time.time()\n",
    "    for start in range(0, total, batch_size):\n",
    "        batch = df.iloc[start : min(start + batch_size, total)]\n",
    "        result = driver.execute_query(\n",
    "            \"UNWIND $rows AS value \" + statement,\n",
    "            rows=batch.to_dict(\"records\"),\n",
    "            database_=NEO4J_DATABASE,\n",
    "        )\n",
    "        print(result.summary.counters)\n",
    "    print(f\"{total} rows in {time.time() - start_s} s.\")\n",
    "    return total"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0fb45f42",
   "metadata": {},
   "source": [
    "## Indexes and Constraints\n",
    "\n",
    "Indexes in Neo4j are only used to find the starting points for graph queries, e.g. quickly finding two nodes to connect.\n",
    "Constraints exist to avoid duplicates, we create them mostly on id's of Entity types.\n",
    "\n",
    "We use some Types as markers with two underscores before and after to distinguish them from the actual entity types.\n",
    "\n",
    "The default relationship type here is `RELATED` but we could also infer a real relationship-type from the description or the types of the start and end-nodes.\n",
    "\n",
    "* `__Entity__`\n",
    "* `__Document__`\n",
    "* `__Chunk__`\n",
    "* `__Community__`\n",
    "* `__Covariate__`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "id": "ed7f212e-9148-424c-adc6-d81db9f8e5a5",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "create constraint chunk_id if not exists for (c:__Chunk__) require c.id is unique\n",
      "\n",
      "create constraint document_id if not exists for (d:__Document__) require d.id is unique\n",
      "\n",
      "create constraint entity_id if not exists for (c:__Community__) require c.community is unique\n",
      "\n",
      "create constraint entity_id if not exists for (e:__Entity__) require e.id is unique\n",
      "\n",
      "create constraint entity_title if not exists for (e:__Entity__) require e.name is unique\n",
      "\n",
      "create constraint entity_title if not exists for (e:__Covariate__) require e.title is unique\n",
      "\n",
      "create constraint related_id if not exists for ()-[rel:RELATED]->() require rel.id is unique\n"
     ]
    }
   ],
   "source": [
    "# create constraints, idempotent operation\n",
    "\n",
    "statements = [\n",
    "    \"\\ncreate constraint chunk_id if not exists for (c:__Chunk__) require c.id is unique\",\n",
    "    \"\\ncreate constraint document_id if not exists for (d:__Document__) require d.id is unique\",\n",
    "    \"\\ncreate constraint entity_id if not exists for (c:__Community__) require c.community is unique\",\n",
    "    \"\\ncreate constraint entity_id if not exists for (e:__Entity__) require e.id is unique\",\n",
    "    \"\\ncreate constraint entity_title if not exists for (e:__Entity__) require e.name is unique\",\n",
    "    \"\\ncreate constraint entity_title if not exists for (e:__Covariate__) require e.title is unique\",\n",
    "    \"\\ncreate constraint related_id if not exists for ()-[rel:RELATED]->() require rel.id is unique\",\n",
    "    \"\\n\",\n",
    "]\n",
    "\n",
    "for statement in statements:\n",
    "    if len((statement or \"\").strip()) > 0:\n",
    "        print(statement)\n",
    "        driver.execute_query(statement)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "beea073b",
   "metadata": {},
   "source": [
    "## Import Process\n",
    "\n",
    "### Importing the Documents\n",
    "\n",
    "We're loading the parquet file for the documents and create nodes with their ids and add the title property.\n",
    "We don't need to store text_unit_ids as we can create the relationships and the text content is also contained in the chunks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "id": "1ba023e7",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>title</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>c305886e4aa2f6efcf64b57762777055</td>\n",
       "      <td>book.txt</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                 id     title\n",
       "0  c305886e4aa2f6efcf64b57762777055  book.txt"
      ]
     },
     "execution_count": 65,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "doc_df = pd.read_parquet(\n",
    "    f\"{GRAPHRAG_FOLDER}/documents.parquet\", columns=[\"id\", \"title\"]\n",
    ")\n",
    "doc_df.head(2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "id": "96391c15",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'_contains_updates': True, 'labels_added': 1, 'nodes_created': 1, 'properties_set': 2}\n",
      "1 rows in 0.05211496353149414 s.\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "1"
      ]
     },
     "execution_count": 66,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Import documents\n",
    "statement = \"\"\"\n",
    "MERGE (d:__Document__ {id:value.id})\n",
    "SET d += value {.title}\n",
    "\"\"\"\n",
    "\n",
    "batched_import(statement, doc_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f97bbadb",
   "metadata": {},
   "source": [
    "### Loading Text Units\n",
    "\n",
    "We load the text units, create a node per id and set the text and number of tokens.\n",
    "Then we connect them to the documents that we created before."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "id": "0d825626",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>text</th>\n",
       "      <th>n_tokens</th>\n",
       "      <th>document_ids</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>680dd6d2a970a49082fa4f34bf63a34e</td>\n",
       "      <td>﻿The Project Gutenberg eBook of A Christmas Ca...</td>\n",
       "      <td>300</td>\n",
       "      <td>[c305886e4aa2f6efcf64b57762777055]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>95f1f8f5bdbf0bee3a2c6f2f4a4907f6</td>\n",
       "      <td>THE PROJECT GUTENBERG EBOOK A CHRISTMAS CAROL...</td>\n",
       "      <td>300</td>\n",
       "      <td>[c305886e4aa2f6efcf64b57762777055]</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                 id  \\\n",
       "0  680dd6d2a970a49082fa4f34bf63a34e   \n",
       "1  95f1f8f5bdbf0bee3a2c6f2f4a4907f6   \n",
       "\n",
       "                                                text  n_tokens  \\\n",
       "0  ﻿The Project Gutenberg eBook of A Christmas Ca...       300   \n",
       "1   THE PROJECT GUTENBERG EBOOK A CHRISTMAS CAROL...       300   \n",
       "\n",
       "                         document_ids  \n",
       "0  [c305886e4aa2f6efcf64b57762777055]  \n",
       "1  [c305886e4aa2f6efcf64b57762777055]  "
      ]
     },
     "execution_count": 67,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text_df = pd.read_parquet(\n",
    "    f\"{GRAPHRAG_FOLDER}/text_units.parquet\",\n",
    "    columns=[\"id\", \"text\", \"n_tokens\", \"document_ids\"],\n",
    ")\n",
    "text_df.head(2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "id": "ffd3d380-8710-46f5-b90a-04ed8482192c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'_contains_updates': True, 'relationships_created': 231, 'properties_set': 462}\n",
      "231 rows in 0.05993008613586426 s.\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "231"
      ]
     },
     "execution_count": 68,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "statement = \"\"\"\n",
    "MERGE (c:__Chunk__ {id:value.id})\n",
    "SET c += value {.text, .n_tokens}\n",
    "WITH c, value\n",
    "UNWIND value.document_ids AS document\n",
    "MATCH (d:__Document__ {id:document})\n",
    "MERGE (c)-[:PART_OF]->(d)\n",
    "\"\"\"\n",
    "\n",
    "batched_import(statement, text_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f01b2094",
   "metadata": {},
   "source": [
    "### Loading Nodes\n",
    "\n",
    "For the nodes we store id, name, description, embedding (if available), human readable id."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 78,
   "id": "2392f9e9",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>type</th>\n",
       "      <th>description</th>\n",
       "      <th>human_readable_id</th>\n",
       "      <th>id</th>\n",
       "      <th>description_embedding</th>\n",
       "      <th>text_unit_ids</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>\"PROJECT GUTENBERG\"</td>\n",
       "      <td>\"ORGANIZATION\"</td>\n",
       "      <td>Project Gutenberg is a pioneering organization...</td>\n",
       "      <td>0</td>\n",
       "      <td>b45241d70f0e43fca764df95b2b81f77</td>\n",
       "      <td>[-0.020793898031115532, 0.02951139025390148, 0...</td>\n",
       "      <td>[01e84646075b255eab0a34d872336a89, 10bab8e9773...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>\"UNITED STATES\"</td>\n",
       "      <td>\"GEO\"</td>\n",
       "      <td>The United States is prominently recognized fo...</td>\n",
       "      <td>1</td>\n",
       "      <td>4119fd06010c494caa07f439b333f4c5</td>\n",
       "      <td>[-0.009704762138426304, 0.013335365802049637, ...</td>\n",
       "      <td>[01e84646075b255eab0a34d872336a89, 28f242c4515...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                  name            type  \\\n",
       "0  \"PROJECT GUTENBERG\"  \"ORGANIZATION\"   \n",
       "1      \"UNITED STATES\"           \"GEO\"   \n",
       "\n",
       "                                         description  human_readable_id  \\\n",
       "0  Project Gutenberg is a pioneering organization...                  0   \n",
       "1  The United States is prominently recognized fo...                  1   \n",
       "\n",
       "                                 id  \\\n",
       "0  b45241d70f0e43fca764df95b2b81f77   \n",
       "1  4119fd06010c494caa07f439b333f4c5   \n",
       "\n",
       "                               description_embedding  \\\n",
       "0  [-0.020793898031115532, 0.02951139025390148, 0...   \n",
       "1  [-0.009704762138426304, 0.013335365802049637, ...   \n",
       "\n",
       "                                       text_unit_ids  \n",
       "0  [01e84646075b255eab0a34d872336a89, 10bab8e9773...  \n",
       "1  [01e84646075b255eab0a34d872336a89, 28f242c4515...  "
      ]
     },
     "execution_count": 78,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "entity_df = pd.read_parquet(\n",
    "    f\"{GRAPHRAG_FOLDER}/entities.parquet\",\n",
    "    columns=[\n",
    "        \"name\",\n",
    "        \"type\",\n",
    "        \"description\",\n",
    "        \"human_readable_id\",\n",
    "        \"id\",\n",
    "        \"description_embedding\",\n",
    "        \"text_unit_ids\",\n",
    "    ],\n",
    ")\n",
    "entity_df.head(2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 81,
   "id": "1d038114-0714-48ee-a48a-c421cd539661",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'_contains_updates': True, 'properties_set': 831}\n",
      "277 rows in 0.6978070735931396 s.\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "277"
      ]
     },
     "execution_count": 81,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "entity_statement = \"\"\"\n",
    "MERGE (e:__Entity__ {id:value.id})\n",
    "SET e += value {.human_readable_id, .description, name:replace(value.name,'\"','')}\n",
    "WITH e, value\n",
    "CALL db.create.setNodeVectorProperty(e, \"description_embedding\", value.description_embedding)\n",
    "CALL apoc.create.addLabels(e, case when coalesce(value.type,\"\") = \"\" then [] else [apoc.text.upperCamelCase(replace(value.type,'\"',''))] end) yield node\n",
    "UNWIND value.text_unit_ids AS text_unit\n",
    "MATCH (c:__Chunk__ {id:text_unit})\n",
    "MERGE (c)-[:HAS_ENTITY]->(e)\n",
    "\"\"\"\n",
    "\n",
    "batched_import(entity_statement, entity_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "018d4f87",
   "metadata": {},
   "source": [
    "### Import Relationships\n",
    "\n",
    "For the relationships we find the source and target node by name, using the base `__Entity__` type.\n",
    "After creating the `RELATED` relationships, we set the description as attribute."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "id": "b347a047",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>source</th>\n",
       "      <th>target</th>\n",
       "      <th>id</th>\n",
       "      <th>rank</th>\n",
       "      <th>weight</th>\n",
       "      <th>human_readable_id</th>\n",
       "      <th>description</th>\n",
       "      <th>text_unit_ids</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>\"PROJECT GUTENBERG\"</td>\n",
       "      <td>\"A CHRISTMAS CAROL\"</td>\n",
       "      <td>b84d71ed9c3b45819eb3205fd28e13a0</td>\n",
       "      <td>20</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0</td>\n",
       "      <td>\"Project Gutenberg is responsible for releasin...</td>\n",
       "      <td>[680dd6d2a970a49082fa4f34bf63a34e]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>\"PROJECT GUTENBERG\"</td>\n",
       "      <td>\"SUZANNE SHELL\"</td>\n",
       "      <td>b0b464bc92a541e48547fe9738378dab</td>\n",
       "      <td>15</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1</td>\n",
       "      <td>\"Suzanne Shell produced the eBook version of '...</td>\n",
       "      <td>[680dd6d2a970a49082fa4f34bf63a34e]</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                source               target                                id  \\\n",
       "0  \"PROJECT GUTENBERG\"  \"A CHRISTMAS CAROL\"  b84d71ed9c3b45819eb3205fd28e13a0   \n",
       "1  \"PROJECT GUTENBERG\"      \"SUZANNE SHELL\"  b0b464bc92a541e48547fe9738378dab   \n",
       "\n",
       "   rank  weight human_readable_id  \\\n",
       "0    20     1.0                 0   \n",
       "1    15     1.0                 1   \n",
       "\n",
       "                                         description  \\\n",
       "0  \"Project Gutenberg is responsible for releasin...   \n",
       "1  \"Suzanne Shell produced the eBook version of '...   \n",
       "\n",
       "                        text_unit_ids  \n",
       "0  [680dd6d2a970a49082fa4f34bf63a34e]  \n",
       "1  [680dd6d2a970a49082fa4f34bf63a34e]  "
      ]
     },
     "execution_count": 71,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "rel_df = pd.read_parquet(\n",
    "    f\"{GRAPHRAG_FOLDER}/relationships.parquet\",\n",
    "    columns=[\n",
    "        \"source\",\n",
    "        \"target\",\n",
    "        \"id\",\n",
    "        \"rank\",\n",
    "        \"weight\",\n",
    "        \"human_readable_id\",\n",
    "        \"description\",\n",
    "        \"text_unit_ids\",\n",
    "    ],\n",
    ")\n",
    "rel_df.head(2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "id": "27900c01-89e1-4dec-9d5c-c07317c68baf",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'_contains_updates': True, 'properties_set': 1710}\n",
      "342 rows in 0.14740705490112305 s.\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "342"
      ]
     },
     "execution_count": 72,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "rel_statement = \"\"\"\n",
    "    MATCH (source:__Entity__ {name:replace(value.source,'\"','')})\n",
    "    MATCH (target:__Entity__ {name:replace(value.target,'\"','')})\n",
    "    // not necessary to merge on id as there is only one relationship per pair\n",
    "    MERGE (source)-[rel:RELATED {id: value.id}]->(target)\n",
    "    SET rel += value {.rank, .weight, .human_readable_id, .description, .text_unit_ids}\n",
    "    RETURN count(*) as createdRels\n",
    "\"\"\"\n",
    "\n",
    "batched_import(rel_statement, rel_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e6365dd7",
   "metadata": {},
   "source": [
    "### Importing Communities\n",
    "\n",
    "For communities we import their id, title, level.\n",
    "We connect the `__Community__` nodes to the start and end nodes of the relationships they refer to.\n",
    "\n",
    "Connecting them to the chunks they orignate from is optional, as the entites are already connected to the chunks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "id": "c2fab66c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>level</th>\n",
       "      <th>title</th>\n",
       "      <th>text_unit_ids</th>\n",
       "      <th>relationship_ids</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>Community 2</td>\n",
       "      <td>[0546d296a4d3bb0486bd0c94c01dc9be,0d6bc6e701a0...</td>\n",
       "      <td>[ba481175ee1d4329bf07757a30abd3a1, 8d8da35190b...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4</td>\n",
       "      <td>0</td>\n",
       "      <td>Community 4</td>\n",
       "      <td>[054bdcba0a3690b43609d9226a47f84d,3a450ed2b7fb...</td>\n",
       "      <td>[929f30875e1744b49e7b416eaf5a790c, 4920fda0318...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  id  level        title                                      text_unit_ids  \\\n",
       "0  2      0  Community 2  [0546d296a4d3bb0486bd0c94c01dc9be,0d6bc6e701a0...   \n",
       "1  4      0  Community 4  [054bdcba0a3690b43609d9226a47f84d,3a450ed2b7fb...   \n",
       "\n",
       "                                    relationship_ids  \n",
       "0  [ba481175ee1d4329bf07757a30abd3a1, 8d8da35190b...  \n",
       "1  [929f30875e1744b49e7b416eaf5a790c, 4920fda0318...  "
      ]
     },
     "execution_count": 73,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "community_df = pd.read_parquet(\n",
    "    f\"{GRAPHRAG_FOLDER}/communities.parquet\",\n",
    "    columns=[\"id\", \"level\", \"title\", \"text_unit_ids\", \"relationship_ids\"],\n",
    ")\n",
    "\n",
    "community_df.head(2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "id": "1351f7e3",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'_contains_updates': True, 'properties_set': 94}\n",
      "47 rows in 0.07877922058105469 s.\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "47"
      ]
     },
     "execution_count": 74,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "statement = \"\"\"\n",
    "MERGE (c:__Community__ {community:value.id})\n",
    "SET c += value {.level, .title}\n",
    "/*\n",
    "UNWIND value.text_unit_ids as text_unit_id\n",
    "MATCH (t:__Chunk__ {id:text_unit_id})\n",
    "MERGE (c)-[:HAS_CHUNK]->(t)\n",
    "WITH distinct c, value\n",
    "*/\n",
    "WITH *\n",
    "UNWIND value.relationship_ids as rel_id\n",
    "MATCH (start:__Entity__)-[:RELATED {id:rel_id}]->(end:__Entity__)\n",
    "MERGE (start)-[:IN_COMMUNITY]->(c)\n",
    "MERGE (end)-[:IN_COMMUNITY]->(c)\n",
    "RETURn count(distinct c) as createdCommunities\n",
    "\"\"\"\n",
    "\n",
    "batched_import(statement, community_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dd9adf50",
   "metadata": {},
   "source": [
    "### Importing Community Reports\n",
    "\n",
    "Fo the community reports we create nodes for each communitiy set the id, community, level, title, summary, rank, and rank_explanation and connect them to the entities they are about.\n",
    "For the findings we create the findings in context of the communities."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "id": "1be9e7a9-69ee-406b-bce5-95a9c41ecffe",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>community</th>\n",
       "      <th>level</th>\n",
       "      <th>title</th>\n",
       "      <th>summary</th>\n",
       "      <th>findings</th>\n",
       "      <th>rank</th>\n",
       "      <th>rank_explanation</th>\n",
       "      <th>full_content</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>e7822326-4da8-4954-afa9-be7f4f5791a5</td>\n",
       "      <td>42</td>\n",
       "      <td>2</td>\n",
       "      <td>Scrooge's Supernatural Encounters: Marley's Gh...</td>\n",
       "      <td>This report delves into the pivotal supernatur...</td>\n",
       "      <td>[{'explanation': 'Marley's Ghost plays a cruci...</td>\n",
       "      <td>8.0</td>\n",
       "      <td>The impact severity rating is high due to the ...</td>\n",
       "      <td># Scrooge's Supernatural Encounters: Marley's ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>8a5afac1-99ef-4f01-a1b1-f044ce392ff9</td>\n",
       "      <td>43</td>\n",
       "      <td>2</td>\n",
       "      <td>The Ghost's Influence on Scrooge's Transformation</td>\n",
       "      <td>This report delves into the pivotal role of 'T...</td>\n",
       "      <td>[{'explanation': 'The Ghost, identified at tim...</td>\n",
       "      <td>8.5</td>\n",
       "      <td>The impact severity rating is high due to the ...</td>\n",
       "      <td># The Ghost's Influence on Scrooge's Transform...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                     id community  level  \\\n",
       "0  e7822326-4da8-4954-afa9-be7f4f5791a5        42      2   \n",
       "1  8a5afac1-99ef-4f01-a1b1-f044ce392ff9        43      2   \n",
       "\n",
       "                                               title  \\\n",
       "0  Scrooge's Supernatural Encounters: Marley's Gh...   \n",
       "1  The Ghost's Influence on Scrooge's Transformation   \n",
       "\n",
       "                                             summary  \\\n",
       "0  This report delves into the pivotal supernatur...   \n",
       "1  This report delves into the pivotal role of 'T...   \n",
       "\n",
       "                                            findings  rank  \\\n",
       "0  [{'explanation': 'Marley's Ghost plays a cruci...   8.0   \n",
       "1  [{'explanation': 'The Ghost, identified at tim...   8.5   \n",
       "\n",
       "                                    rank_explanation  \\\n",
       "0  The impact severity rating is high due to the ...   \n",
       "1  The impact severity rating is high due to the ...   \n",
       "\n",
       "                                        full_content  \n",
       "0  # Scrooge's Supernatural Encounters: Marley's ...  \n",
       "1  # The Ghost's Influence on Scrooge's Transform...  "
      ]
     },
     "execution_count": 75,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "community_report_df = pd.read_parquet(\n",
    "    f\"{GRAPHRAG_FOLDER}/community_reports.parquet\",\n",
    "    columns=[\n",
    "        \"id\",\n",
    "        \"community\",\n",
    "        \"level\",\n",
    "        \"title\",\n",
    "        \"summary\",\n",
    "        \"findings\",\n",
    "        \"rank\",\n",
    "        \"rank_explanation\",\n",
    "        \"full_content\",\n",
    "    ],\n",
    ")\n",
    "community_report_df.head(2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "id": "5c6ed591-f98c-4403-9fde-8d4cb4c01cca",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'_contains_updates': True, 'properties_set': 729}\n",
      "47 rows in 0.02472519874572754 s.\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "47"
      ]
     },
     "execution_count": 76,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Import communities\n",
    "community_statement = \"\"\"\n",
    "MERGE (c:__Community__ {community:value.community})\n",
    "SET c += value {.level, .title, .rank, .rank_explanation, .full_content, .summary}\n",
    "WITH c, value\n",
    "UNWIND range(0, size(value.findings)-1) AS finding_idx\n",
    "WITH c, value, finding_idx, value.findings[finding_idx] as finding\n",
    "MERGE (c)-[:HAS_FINDING]->(f:Finding {id:finding_idx})\n",
    "SET f += finding\n",
    "\"\"\"\n",
    "batched_import(community_statement, community_report_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "50a1a24a",
   "metadata": {},
   "source": [
    "### Importing Covariates\n",
    "\n",
    "Covariates are for instance claims on entities, we connect them to the chunks where they originate from."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "523bed92-d12c-4fc4-aa44-6c62321b36bc",
   "metadata": {},
   "outputs": [],
   "source": [
    "cov_df = (pd.read_parquet(f\"{GRAPHRAG_FOLDER}/covariates.parquet\"),)\n",
    "#                         columns=[\"id\",\"text_unit_id\"])\n",
    "cov_df.head(2)\n",
    "# Subject id do not match entity ids"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3e064234-5fce-448e-8bb4-ab2f35699049",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'_contains_updates': True, 'labels_added': 89, 'relationships_created': 89, 'nodes_created': 89, 'properties_set': 1061}\n",
      "89 rows in 0.13370895385742188 s.\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "89"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Import covariates\n",
    "cov_statement = \"\"\"\n",
    "MERGE (c:__Covariate__ {id:value.id})\n",
    "SET c += apoc.map.clean(value, [\"text_unit_id\", \"document_ids\", \"n_tokens\"], [NULL, \"\"])\n",
    "WITH c, value\n",
    "MATCH (ch:__Chunk__ {id: value.text_unit_id})\n",
    "MERGE (ch)-[:HAS_COVARIATE]->(c)\n",
    "\"\"\"\n",
    "batched_import(cov_statement, cov_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "00340bae",
   "metadata": {},
   "source": [
    "### Visualize your data\n",
    "\n",
    "You can now [Open] Neo4j on Aura, you need to log in with either SSO or your credentials.\n",
    "\n",
    "Or open https://workspace-preview.neo4j.io and connect to your local instance, remember the URI is `neo4j://localhost` and `neo4j` as username and `password` as password.\n",
    "\n",
    "In \"Explore\" you can explore by using visual graph patterns and then explore and expand further.\n",
    "\n",
    "In \"Query\", you can open the left sidebar and explore by clicking on the nodes and relationships.\n",
    "You can also use the co-pilot to generate Cypher queries for your, here are some examples.\n",
    "\n",
    "#### Show a few `__Entity__` nodes and their relationships (Entity Graph)\n",
    "\n",
    "```cypher\n",
    "MATCH path = (:__Entity__)-[:RELATED]->(:__Entity__)\n",
    "RETURN path LIMIT 200\n",
    "```\n",
    "\n",
    "#### Show the Chunks and the Document (Lexical Graph)\n",
    "\n",
    "```cypher\n",
    "MATCH (d:__Document__) WITH d LIMIT 1\n",
    "MATCH path = (d)<-[:PART_OF]-(c:__Chunk__)\n",
    "RETURN path LIMIT 100\n",
    "```\n",
    "\n",
    "####  Show a Community and it's Entities\n",
    "\n",
    "```cypher\n",
    "MATCH (c:__Community__) WITH c LIMIT 1\n",
    "MATCH path = (c)<-[:IN_COMMUNITY]-()-[:RELATED]-(:__Entity__)\n",
    "RETURN path LIMIT 100\n",
    "```\n",
    "\n",
    "#### Show everything\n",
    "\n",
    "```cypher\n",
    "MATCH (d:__Document__) WITH d LIMIT 1\n",
    "MATCH path = (d)<-[:PART_OF]-(:__Chunk__)-[:HAS_ENTIY]->()-[:RELATED]-()-[:IN_COMMUNITY]->()\n",
    "RETURN path LIMIT 250\n",
    "```\n",
    "\n",
    "We showed the visualization of this last query at the beginning."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a0aa8529",
   "metadata": {},
   "source": [
    "If you have questions, feel free to reach out in the GraphRAG discord server: \n",
    "https://discord.gg/graphrag"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
